228 research outputs found

    01 Text Processing 1 - Data Mining - Ingegneria e Scienze Informatiche, Cesena

    Get PDF
    dati strutturati, semi-strutturati e destrutturati, information retrieval e text mining, rappresentazione di documenti, modelli di ricerca booleani, il processo di indicizzazione di documenti, tokenizzazione, normalizzazione, lemmatizzazione, algoritmi di stemming, ricerche con indici, altre ottimizzazioni nella ricerc

    02 Riduzione della DimensionalitĂ  e LSA - Data Mining - Ingegneria e Scienze Informatiche, Cesena

    Get PDF
    Selezione di Feature con Mutual Information e Test Chiquadro, Latent Semantic Analysis (LSA

    05 esercitazione su text classification in WEKA - Data Mining - Ingegneria e Scienze Informatiche, Cesena

    Get PDF
    05 esercitazione su text classification in WEKA - Data Mining - Ingegneria e Scienze Informatiche, Cesen

    Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature

    Get PDF
    Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method

    Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

    Get PDF
    Biomedical events represent complex, graphical, and semantically rich interactions expressed in the scientific literature. Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets, covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and NLG metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data

    Personalized Web Search via Query Expansion based on User’s Local Hierarchically-Organized Files

    Get PDF
    Users of Web search engines generally express information needs with short and ambiguous queries, leading to irrelevant results. Personalized search methods improve users’ experience by automatically reformulating queries before sending them to the search engine or rearranging received results, according to their specific interests. A user profile is often built from previous queries, clicked results or in general from the user’s browsing history; different topics must be distinguished in order to obtain an accurate profile. It is quite common that a set of user files, locally stored in sub-directory, are organized by the user into a coherent taxonomy corresponding to own topics of interest, but only a few methods leverage on this potentially useful source of knowledge. We propose a novel method where a user profile is built from those files, specifically considering their consistent arrangement in directories. A bag of keywords is extracted for each directory from text documents with in it. We can infer the topic of each query and expand it by adding the corresponding keywords, in order to obtain a more targeted formulation. Experiments are carried out using benchmark data through a repeatable systematic process, in order to evaluate objectively how much our method can improve relevance of query results when applied upon a third-party search engin

    A Probabilistic Approach to the Drag-Based Model

    Full text link
    The forecast of the time of arrival of a coronal mass ejection (CME) to Earth is of critical importance for our high-technology society and for any future manned exploration of the Solar System. As critical as the forecast accuracy is the knowledge of its precision, i.e. the error associated to the estimate. We propose a statistical approach for the computation of the time of arrival using the drag-based model by introducing the probability distributions, rather than exact values, as input parameters, thus allowing the evaluation of the uncertainty on the forecast. We test this approach using a set of CMEs whose transit times are known, and obtain extremely promising results: the average value of the absolute differences between measure and forecast is 9.1h, and half of these residuals are within the estimated errors. These results suggest that this approach deserves further investigation. We are working to realize a real-time implementation which ingests the outputs of automated CME tracking algorithms as inputs to create a database of events useful for a further validation of the approach.Comment: 18 pages, 4 figure

    Analisi e gestione informatica di sequenze trascritte in organismi non-modello

    Get PDF
    2011/2012Il tema principale di questo lavoro di tesi è la discussione dei metodi che, mediante l’utilizzo di strumenti creati ad-hoc e di software di terze parti, hanno permesso analizzare sequenze trascritte di 5 organismi non-modello: Mytilus galloprovincialis, Ruditapes philippinarum, Latimeria menadoensis, Astacus leptodactylus e Procambarus clarkii.XXV Ciclo198

    Learning to Predict the Stock Market Dow Jones Index Detecting and Mining Relevant Tweets

    Get PDF
    Stock market analysis is a primary interest for finance and such a challenging task that has always attracted many researchers. Historically, this task was accomplished by means of trend analysis, but in the last years text mining is emerging as a promising way to predict the stock price movements. Indeed, previous works showed not only a strong correlation between financial news and their impacts to the movements of stock prices, but also that the analysis of social network posts can help to predict them. These latest methods are mainly based on complex techniques to extract the semantic content and/or the sentiment of the social network posts. Differently, in this paper we describe a method to predict the Dow Jones Industrial Average (DJIA) price movements based on simpler mining techniques and text similarity measures, in order to detect and characterise relevant tweets that lead to increments and decrements of DJIA. Considering the high level of noise in the social network data, w e also introduce a noise detection method based on a two steps classification. We tested our method on 10 millions twitter posts spanning one year, achieving an accuracy of 88.9% in the Dow Jones daily prediction, that is, to the best our knowledge, the best result in the literature approaches based on social networks
    • …
    corecore